Markov Chain Monte Carlo Sampling of Gene Genealogies Conditional on Observed Genetic Data

نویسنده

  • Kelly M. Burkett
چکیده

The gene genealogy is a tree describing the ancestral relationships among genes sampled from unrelated individuals. Knowledge of the tree is useful for inference of population-genetic parameters such as the mutation or recombination rate. It also has potential application in genomic mapping, as individuals with similar trait values will tend to be more closely related genetically at the location of a trait-influencing mutation. One way to incorporate genealogical trees in genetic applications is to sample them conditional on genetic data observed at present. In this thesis, we describe our Markov chain Monte Carlo (MCMC) based genealogy sampler. First, we describe the sampler that conditions on haplotype data. Our implementation is based on the sampler described in Zöllner and Pritchard (2005). However, we have made several changes to increase the efficiency of sampling. We illustrate the use of our sampler on haplotype data from a publicly-available dataset, where we examine statistics summarizing the degree to which case haplotypes are more related to each other than to control haplotypes. Most genealogy samplers condition on the haplotype data of present day sequences being available. However, commonly used genotyping technology measures genotypes at single loci rather than haplotypes and therefore the haplotype data needs to be imputed. To avoid single imputation, we then describe how the original sampler was extended to handle the case of only genotype data being available. We apply the sampler to simulated data to evaluate how well it estimates genetic parameters and predicts haplotypes. Adequate mixing of the sampler was a concern for some of the test datasets. The mixing difficulties were attributed to substantial dependence between the tree structure and the latent variables introduced to facilitate sampling of the trees. We describe our experiences with using simulated tempering in order to improve the mixing of the sampler. Our heated distributions were chosen so that the dependencies between the latent variables and the tree structure were gradually reduced. iii ABSTRACT iv We apply this approach to a simulated dataset to illustrate how simulated tempering can improve mixing over the haplotype configurations.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

A Markov chain Monte Carlo sampler for gene genealogies conditional on haplotype data

The gene genealogy is a tree describing the ancestral relationships among genes sampled from unrelated individuals. Knowledge of the tree is useful for inference of population-genetic parameters such as migration or recombination rates. It also has potential application in gene-mapping, as individuals with similar trait values will tend to be more closely related genetically at the location of ...

متن کامل

An algorithm to characterize non-communicating classes on complex genealogies

The use of Markov chain Monte Carlo methodology to estimate probability and likelihood functions on complex genealogies has become increasingly popular, providing a practical alternative to methods requiring computation exponentially proportional to the complexity of the pedigree structure. However, in cases with genotypes as latent variables, sampler reducibility can arise as typed individuals...

متن کامل

phylodyn: an R package for phylodynamic simulation

10 We introduce phylodyn, an R package for phylodynamic analysis based on gene 11 genealogies. The package main functionality is Bayesian nonparametric estimation of 12 effective population size fluctuations over time. Our implementation includes sev13 eral Markov chain Monte Carlo-based methods and an integrated nested Laplace 14 approximation-based approach for phylodynamic inference that hav...

متن کامل

Scalable Statistical Methods for Ancestral Inference from Genomic Variation Data

Scalable Statistical Methods for Ancestral Inference from Genomic Variation Data by Andrew Hans Chan Doctor of Philosophy in Computer Science University of California, Berkeley Professor Yun S. Song, Chair Developments in DNA sequencing technology over the last few years have yielded unprecedented volumes of genetic data. The resulting datasets are indispensable for a variety of purposes, from ...

متن کامل

Likelihoods on coalescents : a Monte Carlo sampling approach to inferring parameters from population

Department of Genetics, University of Washington, Box, 7360, Seattle WA 98195-7360 2 When population samples of molecular data, such as sequences, are taken, the members of the sample are related by a gene tree whose shape is affected by the population processes, such as genetic drift, change of population size, and migration. Genetic parameters such as recombination also affect that genealogy....

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2011